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ABSTRACT 



We introduce Locally Linear Embedding (LLE) to the astronomical commu- 
nity as a new classification technique, using SDSS spectra as an example data set. 
LLE is a nonlinear dimensionality reduction technique which has been studied in 
the context of computer perception. We compare the performance of LLE to well- 
known spectral classification techniques, e.g. principal component analysis and 
line-ratio diagnostics. We find that LLE combines the strengths of both methods 
in a single, coherent technique, and leads to improved classification of emission- 
line spectra at a relatively small computational cost. We also present a data 
subsampling technique that preserves local information content, and proves ef- 
fective for creating small, efficient training samples from a large, high- dimensional 
data sets. Software used in this LLE-based classification is made available. 



Introduction 



With the development of large-scale, astronomic al photometric a nd spectroscopic 



surveys such as t he Sloan Digita 



Survey (2MASS; 



Skrutskie et al 



Response System (PanSTARRS; 



Sky 



Survey (SDSS; 



York et al. 



2000), the Two Micron 



2006) , the Panora mic Survey Telescope and Rapid 



Kaiser et al. 



20021 ). and the Canada France Hawaii 



Telescope Legacy Survey (CFHTLS), the amount of data available to astronomers has 
increased dramatically. While these data sets enable much new science, we face the question 
of how we extract information efficiently from data sets that are many terabytes in size and 
that contain a large number of measured attributes. In other words, how do we classify 
objects in a physically meaningful way when the data span a large number of dimensions 
and we do not know a priori which of those dimensions are important? 
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The question of the classification of astronomical data remains open. We can, however 
draw upon a number of successful classification schemes, such as Hubble's morphological 
classification of galaxies and Morgan's spectral classification of stars, and note the 
attributes they have in common. First, classifications must be simple: Hubble's initial 
classification into seven different types remains the dominant scheme used in astronomy 
despi te numerous extensions based on subtypes (e.g. 



de Vaucouleurs 



976) , concentra tion indices and a symmetry parameters ([Abraham et al. 



Conselice 



20031 ). Gini coefficients (jAbraham et al. 



1959 



van den Bergh 



19941 ) 



, dumpiness 



20031 ). and neural networks (ILahav et al. 



19961 ). Second, they must be physically meaningful: the Hubble morphological scheme 
relates to the dynamical history of the galaxy. Finally, the number of classes must account 
for the intrinsic uncertainty in the properties of the observed sources (i.e. we do not want 
to over-fit the data if the scatter in the classifications is large). 

These principles can be applied to spectral data, which have the advantage of being 
easier to map to physical properties of sources than images. Continuum shapes, emission 
line ratios, and absorption indices ca n all be used as prox i es for star formation r ate, stellar 



age, dust absorption and metallicity (IHeavens et al. 



2000 



Reichardt et al. 



200 ll ). As such, 



spectral classification, and in particular statistical techniques for spectral classification, is 



more broadly us ed than morphological c 



based o n color ( 



indices; 



Cramer fe Maeder 



Gulati et al 



1979 



assification. Examp les of this include classifications 



Kunzli et al 



19991 ). model fitting flBruzual fc Chariot 

"X" 



1997), 



spectr al indices (e.g. Lick 



non-parametric tech niques (IConnolly et al. 
Richards et allLoOfl ). 



199E 



Folkes et al. 



2003), and more gener al 



1996 



Yip et al. 



2004a 



The difficulty that spectral classification poses comes from the inherent size and 
dimensionality of the data. For example, the SDSS spectroscopic survey comprises over 
1.5 million spectra, each with 4000 spectral elements covering 3800A to 9800A. Providing 
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a simple and physical classification requires that we reduce the dimensionality of the data 
in a way that preserves as many of the physical correlations as possible. A number of 
techniques have been developed. One such approach that has met with success is the 
Karhunen-Loeve (KL) Transform or Principal Component Analysis (PCA). PCA is a linear 



projection of data onto a set of orthogonal bas i s functions 



applied to galaxy spectra (IConnollv et al. 



Bromley et al. 



1998 



Galaz fc de Lapparent 



995 



eigenvectors). It has been 



199 



Folkes et al 



Ronen et al 



19961: IS. 



Sodre fc Cuevas 



Folkes et al 



s 1997 
19991: 



Yip et al. 



Yip et al. 



2004a; 



Budavari et al 



2009) , QSOs ( 



2004bl ). stellar spectra flSingh et al 



Francis et al. 



measured attributes of stars and galaxies (lEfstathiou fc F all 



1998 



1992 



Boroson fc Green 



Bailer- 



ones et al 



1992 



19981 ) as well as to 



1984J ). T he PCA projection fo r 



galaxy spectra has bee n shown to correlate with star formation rate (jConnolly et al. 



1995 



Madgwick et al. 



2003d ) and is now actively used as a classification scheme in the SDSS and 



2dF spectroscopic surveys. 

As we will discuss within this paper, the linearity of the KL transform, while providing 
eigenspectra that are relatively simple to interpret, has an underlying weakness. It cannot 
easily (nor compactly) express inherently non-linear relations within the data (such as dust 
obscuration or the variation in spectral line widths). Spectra that have a broad range in 
line-widths require a large number of eignespectra to capture their intrinsic variance which 
often results in the continuum emission and emission lines being treated independently 



(IGyory et al 



20081 ). In this paper we consider a new approach to classification, using a 
dimensionality reduction techni que that preserves th e local structure within a data set: 



Local Linear Embedding (LLE; iRoweis fc Saulll2000l ). In Section [2] we will discuss the 



properties of spectral classification using Principal Component Analysis and line ratios. In 
Sections [3] and H] we introduce LLE and apply it to the classification of spectra from the 
SDSS. In Section [5] we consider some of the practical details to consider when applying the 
LLE procedure. In Section [6] we discuss the utility of the LLE as a dimensionality reduction 
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scheme as well as its performance relative to classical spectral classification techniques. 



2. Dimensionality Reduction and Classification of Spectral Data 



2.1. Principal Component Analysis 



Over the last decade PCA or KL classification has been appli ed to a wide range o f 
spectral classification problems. Fro m non-parametric clas sification (jConnolly et al.lll995l ) to 



extracting star formation properties ( 



SDSS spectra (IMadgwick et al. 



2003b 



erreras et al. 



20061 ) to identifying supernovae within 



Krughoff et al. in prep), the many applications of 
PCA have demonstrated the importance of dimensionality reduction in classification. This 
technique seeks to decompose a dataset into a linear combination of a small number of 
principal components, such that each principal component describes the maximum possible 
variance. 

S(\) = J2a l e l (\) (1) 

i 

where 5(A) is the spectrum to be projected, dj are the expansion coefficients, and e«(A) the 
orthogonal eigenspectra. 

Schematically, PCA is equivalent to finding a best-fit linear subspace to the entire 
dataset, s u ch tha t the variance of the data projected into this subspace is maximized. 



Yip et al. 



(I2004al ) present a robust PCA approach to spectral classification of the SDSS 
spectra. In it, they find that the information within the SDSS main galaxy sample can be 
almost completely characterized by the first dozen eigenmodes. This represents a reduction 
in data size by a factor of nearly 400 over the complete spectra. Furthermore, the coefficients 
of the first three eigencomponents correlate with physical properties of the galaxies such 
as star- formation rate, and post star-burst activity. In fact, PCA decomposition can lead 
to a single parameter description of galaxy spectra: the mixing angle between the first 
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and second eigencoef ficients. This mixing 



galaxy spectral type ([Connolly et al. 



1995 



angle has been s 



lown to correlate well with 



Madgwick et al 



2003af ) and can be used for an 



approximate separation between quiescent and emission-line galaxies. 



For continuum emission PCA has a proven record in representing the variation in the 
spectral properties of galaxies. It does not, however, perform well when reconstructing 
high-frequency structure within a spectrum (i.e. the distribution of emission lines, 
lines-ratios, and line- widths). This is due to two effects: principal components preserve 
the total variance of the system, i.e. the overall color of the spectrum, and principal 
components express linear relations between components. High frequency, or local, features 
do not contribute significantly to the total variance and are, therefore, not represented 
in the primary eigencomponents. Variations in line-widths and, often, line-ratios are also 
inherently non-linear in nature (due to the impact of dust or variations in the galaxy mass) 
and are not compactly repres ented as a linear combination of orthogonal components. This 



is shown in 



Yip et al. 



(l2004al ) where the majority of galaxies can be represented by three 



eigenspectra but emission-line galaxies require up to eleven components to express their 
variance. PCA performs poorly when distinguishing between emission line galaxies, e.g. 
separating broad-line QSOs from narrow-line QSOs or star-forming galaxies. 



2.2. Line-Ratio Diagrams 



198ll ) or Osterbrock 



Line- ratio diagrams, sometimes known as BPT plots (Ba ldwin et al. 

i — — inn. 

diagrams (jOsterbrock fc De Robertislll985l ). are one answer to the problems associated with 
PCA classification. Where PCA classifies based on the total variance within a spectrum, 
which weights the continuum emission more than individual emission lines, line-ratio 
diagrams are sensitive to emission-line characteristics while ignoring continuum properties. 



This makes them appropriate for diagnostics that distinguish between galaxies with narrow 
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emission lines. Line-ratio diagrams, however, do not account for continuum emission that 
might provide additional information on the physical state of a galaxy. They are ill-suited 
to classify non-emitting galaxies and galaxies with low signal-to-noise for individual lines, 
and fail completely for galaxies with certain types of emission, e.g. broad-line QSOs 
(often defined as those with emission line FWHM larger than ~ 1200 km s" 1 ). Line-ratio 
diagrams, then, are useful for the classification of only a small subset of observed objects. 

2.3. A Joint Approach to Spectral Classification 

Clearly, PCA and line-ratio diagrams serve complimentary purposes. PCA takes 
into account broad, low-frequency features, and can distinguish galaxies based on their 
continuum properties. Line-ratio diagrams, by design, depend only on the detailed features 
of emission lines, and therefore distinguish spectra based on their narrow, high-frequency 
features. If both types of information could be combined into a joint classification scheme, 
one that treats the spectrum as a whole rather than trying to isolate individual features, 
more insight may be gained into the physical state of a given object. This would be 
particularly helpful for automated classification within spectroscopic surveys, where a 
single, coherent technique could be used to identify objects which could then be analyzed 
further by specialized pipelines. The requirements of such a general classification are: it 
must be sensitive to both low and high frequency spectral information, it must be able to 
account for non-linear relationships between the spectral properties, it should be robust to 
outliers within the data and it should be able to express the properties of a galaxy in terms 
of a small number of physically motivated parameters. In this paper we consider Local 
Linear Embedding as a solution to these questions. 
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3. Locally Linear Embedding 



Locally Linear Embedding (LLE) is a nonlinear dimensionality reduction technique 
that seeks to find a low-dimensional projection of a higher di mensional dataset whi ch best 



20001). It can 



preserves the geometry of local neighborhoods within the data (IRoweis fc Saull l: 
be thought of as a non-linear generalization of PCA: rather than projecting onto a global 
linear hyper-surface, the projection is to an arbitrary nonlinear surface constrained locally 
within the overall parameter space (see Figure [T]) . In this wa y it is superficially s imilar 



Einbeck et al 



2007, for an 



to, e.g. Non-Linear PCA and Principal Curves/Manifolds (see 
introduction to and astronomical application of these techniques). These generalizations 
seek a lower-dimensional projection which optimally represents the information contained 
in each point, such that the original data can be reconstructed from the projection. LLE, 
on the other hand, seeks a projection which optimally represents the relationship between 
nearby points. Accurate reconstruction of data is possible to an extent (see section 
but is not the primary goal of an LLE projection. 

The LLE algorithm consists of two parts: first, a set of local weights is found which 
parametrize the linear reconstruction of each point from its nearest neighbors; second, a 
global minimization is performed to determine a lower- dimensional analog of the dataset 
which best preserves these local reconstruction weights. 



3.1. Description of the LLE Algorithm 

For notational conventions, refer to Appendix |A] We initially consider a series of 
points in a high dimensional space Xi (in the case of SDSS, each spectrum is represented 
by point in a Di n = 4000 dimensional space) such that the set of these points, X, 
is given by X = [x , x 1; x 2 , ...xn], Xi G W Din . We map this to a lower dimensional 
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space Y = [yo, yi, ya, — Yn], Yi e R Dout , with An > £W For each point x ; , let 



n 



(0 



W (*) 
n\ ,n 2 , ...n 



(i)]T 
K J 



be the indices of the nearest neighbors, such that x^h is the jth 
[wf \ ■■■ w Jk] T represent the reconstruction 



nearest neighbor of Xi. Also, let 
weights associated with the K nearest neighbors of 

The key assumption is that each point Xi lies near a locally-linear, low- dimensional 
manifold, such that a tangent hyperplane is locally a good fit. If this is the case, a point 
can be accurately represented as a linear combination of its K nearest neighbors, with 
K > D out . The error in reconstruction can be thought of as the distance from this manifold 
to the point in question. A convenient form of this error is the reconstruction cost function 



K 



(*) 

3 n - ' 



3=1 



(2) 



By minimizing £[ (w^) subject to the constraint 



E 



w 



(0 



(3) 



we find the optimized local mapping. This local mapping has two important properties: first, 
it is invariant under scaling, translation, and rotation; second, it encodes the relationship 
between points of the local neighborhood in a way that is agnostic to the dimensionality 
of the parameter space, i.e., it equally well encodes the properties of the neighborhood in 
observational parameter space, as well as the properties of the neighborhood as projected 
onto the local tangent hyperplane. These two facts lie at the core of the LLE algorithm. 

Once the weights are determined for each point Xi, the same weights are used to 
determine the projected vectors yi which minimize the global cost function, 

N K 



i=l 



(*) 



(4) 



1 Nearest neighbors can be defined based on any general distance metric. In this work, 
Euclidean distance will be used unless otherwise noted. 
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Notice the symmetry inherent in equations [2] and HI Intuitively, the fact that the input data 
X and the projected data Y optimize these linear forms shows that the local neighborhoods 
of X and Y have similar properties. 



Computationally, each of these steps can be implemented 



linear algebra methods (See, for example, 



Roweis fc Saul 



2000 



efficiently using optimize d 



de Ridder fc Duin 



2002). 



We discuss algorithmic considerations in the Appendices [B], [C] and [D] 



3.2. Dimensionality Reduction: A Simple Example 

We provide a simple and classical example of the performance of the algorithm in 
FigurefU The first plot shows a particularly simple nonlinear test set: the "S-curve", with 
N = 2000 data points in three dimensions uniformly selected from the two-dimensional 
bounded manifold described by 

x = sin(#) 

> -1.5tt < 9 < 1.57T 
3 = ^(008(60-1)) J (5) 

< y < 5 

A linear dimensionality reduction such as PCA would not recover the inherently nonlinear 
shape of the two-dimensional manifold: it would find three principal axes within the data. 
An ideal nonlinear reduction would, in effect, unwrap this manifold, and map it to a flat 
plane which preserves local clustering - i.e. the colors would remain unmixed. 

The second plot in Figure [1] shows the two dimensional LLE projection applied to this 
data, with K = 15. It is clear that LLE, in this case, successfully recovers the desired 
2-dimensional embedding. Points of similar colors (which are clustered within the higher 
dimensional space) are clustered in the projected space. As we will discuss in Section 15.21 
the effectiveness of this projection will depend on the size of the region that we consider 
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local (i.e. the number of nearest neighbors) as well as the assumed dimensionality of the 
embedded manifold. 

The above test data densely sample a simply-connected manifold. In the case of 
sparsely sampled, or non-connected manifolds (e.g. due to missing data, survey detection 
limits, etc.), LLE is less successful at recovering the lower-dimensional projection. This 
issue can be addressed using various modifications to the algorithm (see, for example, 



Donoho Sz Grimes 



2003 



Chang fc Yeung 



2006 



Zhang fc Wang 



20071 ). some of which are 



implemented in the publicly available source code for this work (see Appendix lEl) . 



4. An Application of LLE to Spectral Classification 



As a concrete example of the utility of LLE for astronomical classification, we apply LL 



to sp ectra taken from the SDSS DR7 data release 



20081 ). A description of the S DSS can be f ound i n 



( Abazaiian 



York ct al 



fc Sloan Digital Sky Survey 



imaging survey descr ibed by 



Gunn et al. 



(120001 ) with the underlying 



Fukugita et al. 



(119981 ) and the photometric system by 



(119961 ). The subsample used in this analysis was selected using the technique 



described in section l5\3l 



This subsa mple comprises 871 1 total spectra from DR7, with 6930 from the main 



galaxy sample (IStrauss et al. 



20021 ) and 1781 from the QSO sample ( ISchneider et al.ll200Sl ). 
with redshifts z< 0.36. All data have been shifted to a common rest-frame, covering a 
spectral range of 3830 A to 9200 A, resampled to 1000 logarithmically-spaced wavelength 
bins, corrected for significant sky absorption, and normalized to a constant total flux. 
Emission-line and absorption-line equivalent widths have been extracted from the DR7 
FITS headers. Using these data we further subdivide these spectra based on the automated 
classifications of the SDSS spectroscopic pipeline. Of the objects with Hydrogen emission 
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-0.04 -002 0.00 0.02 0.04 
x 



Fig. 1. — The canonical "S-curve" test set for nonlinear data-reduction, top: N = 2000 
points randomly sampled from a bounded 2-manifold (eqn. [5]) embedded in 3-space. Points 
are colored to highlight the connection of local neighborhoods, bottom: the 2 dimensional 
LLE projection of this manifold, with K = 15. Note that points which are clustered in 
the 3-dimensional parameter space are also clustered in the 2-dimensional embedding space. 
This shows that LLE reconstructs the desired embedding. 
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line-strengths greater than three times the spectral noise, 888 objects are classified as broad 
emission line QSOaj (where broad means a line-width of > 1200 km s" 1 ), 893 are defined 
as narrow emission line QSOs, and 6068 are defined as emission-line galaxies (emission-line 
gala xies and narrow-line QSOs are distinguished using the [Nil] /Ha line-ratio diagnostic 



from 



Kewley et al. 



(120011 )). Of the remaining galaxies, 523 are classified as absorption-line 
galaxies (Balmer absorption strengths greater than 3a), and 339 are classified as quiescent 
galaxies (Balmer emission strengths less than 3a). 

Figure [2] shows the LLE projection of the SDSS spectra from the original 1000 
dimensional space to a three dimensional subspace. To minimize the effect of outliers, we 
use a robust variant of LLE (see section [5TTT) . We plot the position of all of the sources 
within this subspace and color-code the points based on the above divisions. Red points are 
broad-line QSOs, orange points represent narrow-line QSOs, green points are emission-line 
galaxies and light and dark blue points are galaxies with continuum emission and strong 
absorption line systems respectively. 

The axes of an LLE projection are uniquely defined up to a scaling and a global 
rotation. The interpretation of the projection comes, therefore, from the relationship 
between the points rather than from the axes themselves. In Figure [2] it is clear that 
there are a few well-defined branches in the resulting diagram that separate the sources 
based on their spectral properties. QSOs (broad and narrow line), emission-line galaxies 
and continuum galaxies separate in to three basic clusters or classifications. Within these 
clusters broad and narrow line QSOs can be distinguished as occupying separate parts of 
the subspace, emission line galaxies are distinct from absorption line systems while the 
strong absorption line systems closely track the subspace of the continuum galaxies. The 
positions of galaxies with low emission and absorption line strengths converge to a common 



2 Note: broad line QSOs in SDSS DR7 have SPEC_CLN=3 
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Fig. 2. — Robust LLE projection of the data subsample from 1000 down to three dimensions. 
Broad-line QSOs are identified by spectral classification within the SDSS pipeline, narrow- 
line QSQs and emi s sion g alaxies are distinguished using the [Nil] /Ha line-ratio diagnostic 



from 



Kewley et al 



(120011 ) . and emission or absorption refers to galaxies with line strengths 
at three times the continuum noise. The dashed boxes indicate the "broad-line QSO region" 
and the "emission galaxy region" discussed in section 14.21 
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region of the 3-dimensional space, which is to be expected as galaxies do not represent a 
series of individual classifications but rather a continuum of properties. 

For the remainder of the paper, all analysis will be done based on the two-dimensional 
manifold defined by the first and third components (i.e. the larger plot in Figure [2]). 
This sub-projection gives maximal discrimination between the QSO and emission galaxy 
branches (for further discussion of this choice of projection, see section [572]) . 

4.1. Interpretation of the LLE Tracks 

The distance of a galaxy along any one of the branches within Figure [2] correlates 
well with its physical characteristics (as reflected in their spectra). Figures [3] through [7] 
show progressions along various regions, along with the mean spectrum of each labeled 
region. When calculating the mean, we exclude excessively noisy objects, identified as those 
with flux -F(A) satisfying J |F(A)|<iA > 1.5 J F(X)d\. Because the spectra are preprocessed 
with a normalization, the few spectra satisfying this inequality would otherwise contribute 
disproportionately to the mean spectrum in each region. 

Figure [3] shows the progression of spectra along the emission galaxy track. To do this 
we calculate the mean spectrum for galaxies contained within the regions associated with 
the bounding boxes El through E5. As we will show later, this sequence not only tracks 
increasing line strengths, but also changing line ratios (see Figure [T2|) . 

Figure H] shows the progression of spectra along the broad-line QSO track. We find 
that, as we move from Ql to Q5, the spectral type of the QSO evolves from a Seyfert 1.9 to 
a classic broad line QSO. As would be expected for such a transition the Nil emission line 
strength decreases while the B./3 line strength increases as the spectra become progressively 
dominated by the accretion disk emission. Associated with these emission line properties 
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we find that the continuum slope of the QSO spectra are bluer for the Q5 class than for 
Ql. This not only supports the classification of the spectra from Seyfert 1.9 to broad-line 
AGN, it also demonstrates how LLE can use jointly the information contained within the 
continuum and line emission. 

Figure [5] shows the progression of spectra along the narrow-line QSO track. As with 
the emission-line galaxy track, the narrow QSO track traces increasing emission from Nl to 
N5. N1-N3 have narrow H/? and Ha, with line-ratios that indicate power-law ionization, 
and are spectrally consistent with LINERs. N3-N5 have increasingly broad Ha, which 
suggests classification as Seyfert 1.9-2.0. Line-ratio diagnostics confirm this (see Figure [T2"j) . 

In Figure [6] we consider the progression of spectra across the different spectral groupings 
(from broad line QSO to galaxies with narrow emission lines) as opposed to along an 
individual track. The X3 sources are, as before, broad-line QSOs with strong H/3 emission. 
As we transition to X2 the width of the emission line decreases until X2 has the spectral 
characteristics of a Seyfert type 1.5. By XI the spectral properties are consistent with 
an HII region or narrow emission line galaxy (for wavelengths < 5500 A) while showing 
evidence for an AGN component in the broad line Ha. This one parameter sequence from 
broad-line QSO to HII region spectra (including regions where the populations are mixed) 
is described very naturally with this LLE projection. 

The last subsample we consider are the quiescent galaxies in Figure [7J The transition 
from Gl to G5 tracks the same spect ral classification fr om quiescent to active star formation 



as found using the PCA analysis of 



Yip et al. 



(J2004aj). In terms of 



spectral p r opert ies 



galaxies classified as Gl correspond to the E/SO spectral type from iKennicuttl (119981 ) and 
galaxies in G5 correspond to the Sbc/Sc Kennicutt spectrum. Along this progression the 
emission line strengths for Ha and Nil increase and the strength of the MgUI absorption 
line decreases. It is also noted that the depth of the break at 4000 A decreases with 



increasing index number. Comparisons with the results of lYip et al.l (l2004al ) indicate that 
this trajectory maps directly to the star formation rate within the galaxy (see Figure [TT]). 

From these initial comparisons it is clear that by projecting onto a three dimensional 
space using LLE we can differentiate between line-emitting galaxies and those dominated 
by continuum emission. We can also distinguish between sources as a function of their 
emission line characteristics. As such, LLE encapsulates both the line emission and 
continuum emission properties in its classification scheme. That LLE can express the 
variation in continuum properties of galaxies wh en projecting from 100 to 3 dimensions i s 



Connolly et al 



(1995) 



Yip et al. 



fl2004ah 



not altogether surprising as the PCA analyses of 
and others have demonstrated that quiescent galaxies occupy only a small region of the 
available parameter space and that the dominant direction in this space is governed by 
the star formation rate. For emission line galaxies, and in particular QSOs, standard 
PCA approaches require between 10 and 50 components to express the variation in line 
strengths as a function of galaxy type. This is due to the inherently non-linear nature of 
the reconstruction of emission lines. 

The fact that LLE can be used to separate both broad and narrow emission lines into 
separate groups or families, that the transition between narrow and broad line spectra is 
smooth, and that the position of a galaxy within each of these groups is dependent on the 
individual line strengths demonstrates the ability of LLE to concisely represent inherently 
non-linear systems in an intuitive manner. 



4.2. Separation of Spectral Types 



LLE is useful in that it maps a large-dimensional dataset to a smaller dimensional 
parameter space. Within this projected space, many familiar classification schemes can be 
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Fig. 3. — Progression of emission galaxy spectra. Each spectrum represents the mean of the 
spectra within the labeled region. Grey shading indicates the standard deviation about the 
mean. See caption of Figure [2] for description of colors. 
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Fig. 4. — Progression of broad-line QSO spectra. Each spectrum represents the mean of the 
spectra within the labeled region. Grey shading indicates the standard deviation about the 
mean. See caption of Figure [2] for description of colors. 
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Fig. 5. — Progression of narrow-line QSO spectra. Each spectrum represents the mean of 
the spectra within the labeled region. Grey shading indicates the standard deviation about 
the mean. See caption of Figure [2] for description of colors. 
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Fig. 6. — Progression from broad to narrow QSO spectra. Each spectrum represents the 
mean of the spectra within the labeled region. Grey shading indicates the standard deviation 
about the mean. See caption of Figure [2] for description of colors. 
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Fig. 7. — Progression of quiescent galaxy spectra. Each spectrum represents the mean of 
the spectra within the labeled region. Grey shading indicates the standard deviation about 
the mean. See caption of Figure [2] for description of colors. 
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applied (see lHastie et al.l 120011 . for a thorough introduction to the topic). In the case of 
SDSS spectra, LLE succeeds in mapping physically similar spectra to the same region of 
parameter space. More importantly, physically different spectra are mapped to distinct, 
well-separated regions, leading to the possibility of a very intuitive classification scheme. 

Because of this, we can quickly locate objects that are misclassified by both the 
SDSS pipeline and the line-diagnostic plots. Figure [8] displays a few examples of these. 
Ml is classified by the SDSS pipeline as a broad-line QSO, but falls on the emission-line 
galaxy track. This mis-classification is likely due to contamination of the Ha line from 
the neighboring Nitrogen features, leading to an erroneous large line-width. M2 is not 
recognized as an emitting object, because the Ha feature is missing. In fact, the entire 
narrow track on which M2 lies consists of objects which look like emission galaxies/type-2 
Seyferts, with the Ha line missing due to saturation of the lines, coincident sky lines, and 
bad pixels. LLE isolates these points on their own track of outliers. M3 and M4 fall on 
the broad-line QSO track, but are not recognized as broad emitters by the SDSS pipeline, 
probably due to high background noise. The LLE projection correctly groups these with 
other strong emitters. Because the LLE projection is entirely neighborhood-based, any 
given point will have similar properties to nearby points in the projection space. 

Next we compare the classification based on LLE to the classification by more 
traditional methods, e.g. emission line-widths, line-strengths, and line-ratios. This requires 
a choice of classification technique for points in the LLE-projection space. For illustrative 
purposes, we create a simple classification scheme and define two well-separated regions 
which correspond to physically different objects. The "broad-line QSO region" is defined 
as all points on the broad-line track with LLE component three greater than 0.02. The 
"emission galaxy region" is defined as all points on the emission galaxy track with LLE 
component one greater than 0.015 (see figure [2]). 
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Out of the 624 galaxies in the broad-line QSO region, 31 were classified by SDSS as 
having narrow line emission, and 11 were classified as quiescent. Through visual inspection, 
29 of the 31 and 1 of the 11 were found to show broad emission features, with the rest of 
the spectra too noisy to accurately classify. The SDSS pipeline, therefore, mis-classifies 
about 5% of their broad-line sources as narrow-line. 

Out of the 675 galaxie s in the emiss i on ga laxy region, 45 have line ratios on the 



narrow-line QSO side of the 



Kewley et al 



(120011 ) cutoff, two have Ha emission broader than 



1200km s 1 , and 4 have prominent emission features not detected by the SDSS pipeline. 

Combining the above tallies, we can estimate how successfully LLE clusters physically 
similar spectra. From a subsample which by design contains a high fraction of abnormal 
spectra (see Section I5T3"|) . less than one percent of the spectra were physically dissimilar from 
the average population in each region. These errors are due primarily to ver y noisy spectra . 



Kewley et al. 



for ot her cases, e.g. the 45 spectra where classifications based on LLE and 
20011 ) line diagnostics disagree, it is not immediately clear which classification is correct, if 
any The majority of these sources are transitional, falling near the dividing boundary of 
the line-ratio diagram. This dividing line is an arbitrary analytic separation between two 
regions of a certain subset of the parameter space. We note, however, that the LLE-based 
classification is without any a priori training. We do not preselect sources based on 
known properties nor fine-tune the parametrization of the classification, as opposed to the 
optimized SDSS pipelines. 

Figure M shows a sample of some significant outliers identified with LLE. 01-03 are 
outliers due to their noise. 04 and 06 appear to be sky spectra which were misclassified 
based on emission features. 05 appears to be a QSO with its Ha line missing, possibly 
due to atmospheric absorption. 07 has an artificial feature likely due to faulty sky-noise 
subtraction. In all of these cases, the spectrum's status as an outlier based on LLE reflects 
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its unusual spectral characteristics. 



5. Considerations when applying LLE 



5.1. Robust LLE: dealing with noisy and missing data 



As with any classification scheme it is important that we can account for missing and 
noisy data in a robust manner. As LLE is a locally weighted projection it is expected that 
outliers (even outliers for a single pixel) can impact the definition of the local neighborhood 
and might bias any classification scheme. 

In practice it is difficult to treat this problem completely as it essentially replaces the 
least squares analysis of each neighborhood with a Total Least Squares/ Errors in Variables 
analysis for each of the N points. These cannot be solved in closed-form, and re quire 



a computationally expensive iterative process (see 



van Huffel fc Vandewalle 



1991 



for a 



detailed exploration of the Total Least Squares problem). 



Chang fc Yeungj (120061 ) and address the 



We can, however, follow the methodology of 
problem of outliers using a Robust LLE (RLLE) technique. In RLLE, a reliability score for 
each data point is determined based on its efficacy in predicting the location of the points in 
its neighborhood. Outliers will have two general properties: first, they will not be members 
of large number of local neighborhoods; second, within the neighborhoods in which they 
appear, they will be far from the best-fit hyperplane. With this in mind, reliability scores 
are calculated using the following method: 



1. For e ach point, an iteratively reweighted least squares analysis ([Holland fc Welsch 
19771 ) is performed to determine optimal weights for a weighted PCA reconstruction 
of the point from its neighbors. Schematically, this is akin to finding the best-fit 
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Fig. 8. — A sample of objects with discrepant classifications See caption of Figure [2] for 
description of colors. 
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Fig. 9. — A sample of outliers from the LLE projection See caption of Figure[2]for description 
of colors. 
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hyperplane, weighting points based on their distance from that hyperplane, and 
repeating until the process converges. As a result of this process, each point in a local 
neighborhood is assigned a local weight which quantifies its contribution to the local 
tangent space. 

2. Each point may be a member of many neighborhoods. The reliability score for a point 
is determined by taking the sum of its local weights from each neighborhood of which 
it is a member. 

Thus, a higher score will be given to points which are part of many neighborhoods, and 
points which contribute more to the reconstruction in each neighborhood. With this in 
mind, outliers can be identified based on their low reliability score. 

In Figure 15.11 we compare the performance of LLE and RLLE on the S-curve dataset 
from section 13721 In this case we assign 15% of the points within the data set to be randomly 
selected from the full parameter space. The lower left panel shows the application of the 
standard LLE approach. As is clear from this Figure, outliers can change the local weights 
of the nearest neighbors and, thereby, distort the resulting lower dimensional manifold; 
causing the points of different colors to be heavily mixed. Effectively, the random noise 
within the point distribution fills in enough of the three-dimensional space that LLE cannot 
extract the lower dimensional manifold. For RLLE, as long as the number of random 
points is small relative to the number of nearest neighbors, we can isolate and exclude the 
contaminating points from the local weights. Figure 15.11 shows that we can recover the 
underlying two-dimensional manifold in the presence of noise and outliers through this 
approach. 
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X X 

Fig. 10. — Top: N = 2000 points randomly sampled from a bounded 2-manifold embedded 
in 3-space, with 15% of the points random outliers sampled from the entire parameter space. 
Bottom Left: the 2-dimensional LLE projection of this manifold, with k = 15. Note the 
mixing of the colors: the outliers obscure the 2-dimensional manifold to the point that LLE 
cannot recover it. Bottom Right: the 2 dimensional RLLE projection of this manifold, with 
k = 15, constructed using 80% of the points with the highest reliability scores. RLLE 
recovers the desired projection. 
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5.2. Choosing Optimal Parameters 

There are three free parameters that must be chosen for LLE: K, the number of 
nearest neighbors, D out , the output dimension, and r, the regularization parameter. K and 
D out will be discussed here, while r, which depends on details of the implementation, is 
addressed in appendix [B] 

Often in PCA, instead of specifying an output dimension D out , one chooses to specify 



a percentage of the sample variance which should be preserved. I n LLE, dimensiona 



ity ca n 



20021 ) 



be determined in an analogous way for each local neighborhood (Ide Ridder fc Duin 
In the computation of an LLE projection, a variant of the covariance matrix is computed 
for each local neighborhood. This matrix differs from the covariance matrix used in PCA 
by only a translation and scaling, neither of which affect the relative magnitudes of the 
eigenvalues. Taking the lead from PCA, we can specify a percentage of the local variance 
which we would like to conserve, and thereby find the minimum number of dimensions 
required locally, simply by doing an eigenanalysis of the local covariance matrix. Taking 
the mean of the iV local dimensionality determinations gives an estimate of the optimal 
value of D out for the global projection. For our SDSS analysis, the first three components 
contain an average of 75% of the local variance. The first ten components contain 92% of 



the local variance, si gnificantly 



PCA eigenspectra of 



Yip et al. 



ess than the 99.9% of total variance within the first ten 



(l2004al ). This discrepancy is likely due to noise and intrinsic 
variation at the local level, which contributes more to the small, local neighborhoods which 
LLE probes than to the overall variance that PCA probes. Also, our sampling technique 
(Section 15. 3ft chooses sources based on dissimilarity, leading to more intrinsic variation in 
the subsample. 

Choosing the number of nearest neighbors is more difficult. In fact, K need not 
be constant across all neighborhoods. It could be specified based on local properties of 
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the manifold or the density of sources at a given point. The literature on LLE offers no 
consensus on how to choose the optimal value of K. Too large a value, and the manifold 
is oversampled, destroying the assumption of local linearity, and leading to an inherently 
larger dimensional space. Too low, and the manifold is undersampled, which leads to a 
loss of information on the relationships between neighborhoods and more sensitivity to 
the presence of outliers and noise. In this work we take an empirical approach and try to 
estimate a series of heuristics that can be applied to the spectral data. We take a range of 
values for K (10 < K < 30) and determine our ability to maximally distinguish between 
galaxy populations in the resulting LLE classifications. In the case of galaxy spectra, the 
K is chosen which maximizes the opening angle between the QSO and emission-line galaxy 
sequences, measured between any two of the three leading projection dimensions. We find 
that a value of K = 23 provides the maximal discrimination but that values of K from 20 
to 27 are comparable. 

It should be noted that the final LLE projection is unique only up to a global 
translation and rotation. Because of this, the exact nature of the projection is highly 
dependent on the subset of data chosen, as well as the above-mentioned input parameters. 
What is invariant, however, is the relationship between each point and its neighbors. In 
particular, an outlying point in one projection will be an outlier in another projection. This 
is what makes LLE useful: it allows one to easily visualize the relationships between points 
in a high-dimensional parameter space. 

5.3. Fast LLE: efficient sampling strategies 

The LLE algorithm, even when optimized, is computationally expensive. The main 
bottleneck is the neighbor search, which, for a single point is 0[ND in ]. Using brute- force 
for all N points is, at worst, 0[N 2 D in \. While more advanced tree-based algorithms 
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can significantly improve on t his, it remains a com putational hurdle, especially for 
high-dimensional datasets (see 



Berchtold et al 



1998 



for a review of fast nearest-neighbor 



algorithms) . 



The second bottleneck is the computation of the global projection, which involves 
determining the bottom D out + 1 eigenvectors of a [N x N] sparse matrix (see appendix 
|C|) . By using iterative techniques such as Arnoldi/Lanczos Decomposition, it is possible 
to determine these eigenv ectors without performing a full matrix diagonalization (see 



Wright fc Trefethen 



2001 



for one well-tested optimized algorithm). 



Even fully optimized, it is not feasible to learn the LLE projection from the full 
SDSS spectral sample, which would amount to hundreds of thousands of points in a 
4000-dimensional dataspace. What is more, a random subsample of these data will be too 
sparsely populated in some regions of space (e.g. for strong line emitters that account for 
< 0.01% of a randomly sample of spectra from the SDSS), and too densely populated in 
others (e.g. around L* galaxies). In this case LLE will not sufficiently probe the entire 
structure of the manifold. For this reason, a strategy is needed to define a subsample 
that best spans the entire parameter space, without overpopulating the most probable 
neighborhoods. 



To do this, we follow 



de Ridder &: Duinl (120021 ) and appeal to the fundamental 



assumption of local linearity on the manifold. For a given point Xi, the first step of LLE 
is to compute the weights required to construct it from its K neighbors. The procedure 
outlined in section 15.21 for determining the intrinsic dimensionality can be used to find 
the local reconstruction error: i.e., the percentage of total variance not captured in the 
.Dj n -dimensional projection. Points for which this local projection error is small add very 
little information to the overall projection. They can be discarded without much effect. 
Points for which this reconstruction error is large are not well-approximated by a linear 
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combination of neighbors, and, therefore, contain information not present in the local 
best-fit hyperplane. If they are thrown out, the projected manifold will change significantly. 

With this in mind, the following strategy was used to reduce the ~ 170, 000 unique 
spectra to a manageable 9, 000: 

1. Divide the sample into groups of 2000 spectra each. 

2. For each group: 

(a) find the eigenvalues of the local covariance matrix of each point, and create a 
cumulative distribution based on the local reconstruction error. 

(b) randomly select 20% of the points based on this cumulative distribution, such 
that those with the largest reconstruction error are preferentially selected. 

3. merge every 5 subsamples into a new sample of 2000 spectra 

4. repeat this procedure until the total subsample is of the desired size. 

5.4. Applying LLE to new data sets 

Once the projection is determined for a given training sample, it is straightforward 
to determine the projection of new data onto the training surface. For each new point, 
the K nearest neighbors are determined within the training data. Weights are determined 
by minimizing the form of Equation [2J These weights are then used within Equation H] to 
determine the projection of the point. Note that this second step amounts to nothing more 
than a weighted sum of the projected neighbors. This is a key point in the application of 
LLE: once a projection is defined for a representative training sample, the computational 
cost of the classification of a set of test-points is largely due to a i^-nearest-neighbor search 
within the training set. This can be done optimally in 0[D in log N]. 
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Reconstructing unknown data from projected coordinates requires the reverse of the 
above procedure. Weights are determined by minimizing the corollary of Equation [2] within 
the projected space, and these weights are used to reconstruct the point in the original data 
space. 



6. Discussion 

As we have shown in section HI LLE provides an efficient and automated way of 
classifying high dimensional data (in our case galaxy spectra) that can account for inherent 
non-linearities within these data. For spectroscopic observations LLE can jointly classify 
spectra based on their emission-line and continuum properties which makes it well suited to 
the task of automated classification for large spectroscopic surveys. In the following section 
we compare the output from the LLE classifications with those from standard spectral 
classification techniques and discuss how LLE might be adopted for the next generation of 
spectroscopic surveys. 



Figure [IT] shows the PCA mixing-angle plot using the publicly available 



Yip et al. 



(]2004al ) SDSS galaxy eigenspectra (see Figure 8 in that work). The average 



sequenc es from Figures [3H3 are over-plotted for co mparison. As discusse d by 



Connolly et al. 



spectral 



Yip et al 



19951 ) or star formation 



(l2004al ). 4>kl correlates with spectral type (see also 
rate. It is, however, not well-suited to distinguishing between various types of line emission: 
broad-line QSOs, narrow-line QSOs, and emission-line galaxies overlap the same region of 
the projection space. This is a reflection of the fact that PCA is sensitive to continuum 
information rather than to line emission. Accounting for line emission in a PCA framework 
requires extracting the emission lines and applying PCA to the line equivalent widths 



( IGyory et al. 



20081 ). 
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Fig. 11. — Principal component mixing-angle diagram for the SDSS spectra subsample. The 
labeled tracks correspond to average spectra plotted in figures HUH See caption of Figure [2] 
for description of colors. 
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Figure [T2l shows the typical line-ratios used to distinguish between star-for ming galaxies 



and n arrow-line QSOs. The dividing lines between the populations are due to 



Kewley et al. 



(120011 ) . The tracks on the plot correspond to the average spectra in Figures [3] and (Note 



that line-ratio diagrams do not meaningfully cl assify broad-li n e QS Os, which uniformly 



Kewley et al 



(]200ll ) use stellar synthesis 



occupy the entire space of the line-ratio plots), 
models to show that emission characteristics of the objects below the dividing line can be 
attributed to ionization due to young stellar populations, with the ionization parameter q 
increasing toward the upper- left. This gives a physical interpretation of the emission galaxy 
(E) sequence of Figure [3l increasing star formation from El to E5. 

The region above the div iding line , there fore, is due to ionization which cannot be 



attributed to star formation. 



Ho et al 



(119971 ) show empirically that this region can be 



subdivided into transitional objects near the dividing line, LINERs which lie near N1-N2, 
and Seyfert-2 galaxies which lie near N3-N5. The narrow-line QSO (N) sequence of Figure 
[5] traces this transition. 

The above discussion points to the usefulness of LLE for automatic classification, 
without requiring the extraction of line parameters from the spectroscopic data. As with 
most classification schemes, the bulk of the work happens in the training of the LLE. Once 
the projection parameters (Section [5.21) and training subsample (Section [5.31) are chosen, the 
resulting projection provides the basis for a unique, repeatable classification scheme, with 
computational cost dominated by a i^-nearest neighbor search within the training sample 
(Section 15.4)) . As such LLE can provide an accurate initial classification and anomaly 
detection step for any automated spectroscopic pipeline, requiring only the extraction of 
spectroscopic redshifts. 

LLE is not limited to spectroscopic information. In principle, LLE can be used to 
visualize any type of data, even non-homogeneous data sets, e.g. a combination of fluxes, 
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Log[[S Ill/Ha) 



Fig. 12. — Line-ratio diagram of the objects in the subsample which display 3<r emission 
with a width < 1200km. Points above the cutoff are considered narrow-line QSOs, while 
points below the cutoff are considered emission-line galaxies. Note that broad-line QSOs are 



not meaningfully classified by these diagrams, and are omitted. Lines show t 
between narrow-line QSOs and star-forming galaxies given in iKewley et al. 



re boundaries 



(1200 lh . The 



tracks labeled E and N are the average spectra from Figures [3] and El respectively. 
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redshifts, angular size, etc. Care must be taken, though, to make sure that non-homogeneous 
dimensions contribute equally to the nearest neighbor search. This can be assured through 
modification of the distance metric for the nearest-neighbor search such that the distance 
between point x and point y are 



where W{ is a weight associated with measurement dimension i. For example, if the 
measurements in question are fluxes that range from to 1000 and redshifts that range from 
to 1, Wi could be set to reflect the parameter range: Wi = [max(xj) — min(xj)] -1 . This 
normalization of the data prior to the LLE step is common to any dimensional reduction 
technique, including PCA. 

The computational cost of learning the LLE required that we develop a subsampling 
technique that preserves the complexity of the underlying data (i.e. as opposed to a 
random sampling that will be dominated by the typical sources within a population). 
Neighborhood-based subsampling is important in two key areas: it yields a subsample 
which fills the entire observational parameter space, and the data are uniformly sampled 
throughout. Furthermore, the local nature of the process means that subtle nonlinear 
effects are not obscured in the process. This subsampling technique, described in Section 
15.31 is broadly applicable, regardless of the classification scheme being employed. It could 
be applied to any problem that requires sampling of a large data set - such as spectroscopic 
follow-up of optical or radio selected galaxies or spectroscopic observations of galaxies for 
training galaxy photometric redshifts. 




(6) 
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7. Conclusions 

Fast classification of large data sets is an increasingly present problem in astronomy. 
With this paper we show Locally Linear Embedding to be a powerful tool in the classification 
SDSS spectra that is sensitive to both low-frequency and high-frequency information. As 
such, is an improvement over standard techniques such as PCA/KL decomposition and 
line-ratio diagnostics. LLE is still in its infancy, and more understanding is needed to take 
into account measurement uncertainties, missing data, and undersampled manifolds. We 
discuss the limitations of LLE due to the computational cost of operating on large data 
sets and how these can be addressed through intelligent sampling of the training data. Our 
sampling technique is based on local information content and yields a training subsample 
which preserves the nonlinear traits of the sample as a whole. As a dimensionality reduction 
technique, LLE is both fast and accurate, and sensitive to all the information contained 
in measurement data. Because of this, it has potential to become a useful step in efficient 
classification of large astronomical data sets. 



LLE software used in this paper is publicly available at http: / / ssg.astro.washington.edu/software 
(see appendix |E|) . 



We thank S. Anderson, S. Krughoff, A. Becker, and T. Budavari for helpful comments 
and conversations. We thank the anonymous referee for several helpful suggestions. AJC 
is partially supported through NSF award 0851007. JTV is partially supported through 
NASA award NNX07AH07G and NSF award 0851007. 
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A. Notational Conventions 



Throughout this paper, matrices are denoted by bold capital letters (X, Y, etc.), 
vectors are given by bold lowercase letters (x, y, etc.), while scalar values are in normal-face 
type (X, y, A, etc.). Subscripts reference the individual elements of matrices, such that 
Xij is the element in the ith row and jth column of matrix X. The zth column vector of 
matrix X is given by Xi. Note the potential for confusion with this convention: (xj)j = X^. 
To limit this source of confusion, individual elements of a matrix will always be referenced 
via the latter formulation. I is the square identity matrix, such that 1^ = 5ij, where 8^ is 
the Kroniker delta function. is the zero matrix, such that 0^ = for all i and j. The 
sizes of these two matrices should be clear from the context in which they are u sed. The 
derivations in the f ollowi ng sections are based on those found in 



Roweis fc Saul! (120001 ) and 



de Ridder fc Duinl (120021 ). with the addition of a few clarifying details. 



B. Determining the Optimal Weights 



The optimal weights are determined by minimizing the reconstruction error given 
in equation [2j 



£f (w«) 



K 



This can be straightforwardly solved in closed form, using the method of Lagrangian 



(Bl) 



multipliers. Consider a neighborhood about the point Xi, with weights Wj that satisfy 

K 

1. (B2) 



5>1 



(0 



This constraint is useful because it imposes translational invariance to the projection. Using 
this condition, the above sum can be rewritten 



K 



3=1 



(B3) 



-41 - 



Now defining the local covariance matrix 



C Jk = ( X i" X n«) ( X i- X «« 



(B4) 



K K 



We find 

i=i fc=i 

We can minimize this by applying a Lagrange multiplier Aj to enforce the constraint given 



*(0 
ik 



(B5) 



in equation 



K K 



^(w w ) = EEW + ^( 1 -E 

i=i fe=i j 



(0 



(B6) 



Minimizing (w^) with respect to gives the condition of minimization: 



C«w« = Ail 



(B7) 



where 1 = [1, 1, ...1] T . Defining R.W to be the inverse of the local covariance matrix C^, 
we are left with 

w « = A,R W 1 (B8) 



Now imposing the constraint (IB2I) leads to 



v v 



EE 4 

j=i fc=i 



(B9) 



Evaluation of with equations IB8I and IB9I requires explicit inversion of the local 
covariance matrix. In practice, a more efficient method of calculation is to solve a variation 
of equation IB71 

C«w« = [1,1,1. ..1] T , (BIO) 

and then rescale the resulting such that the constraint in equation IB2I is satisfied. 

In general, the local covariance matrix may be nearly singular. In this case, 
the weights are not well defined. Ide Ridder fe Duinl ( 20021 ) note that the matrix can 
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be regularized based on its eigenvalues Because, in the end, we are doing a global 



projection to a dimension D out , we can follow probabilistic PCA and define 



1 



K 



r = 



K -D, 



out 



(Bll) 



j=D out +l 



i.e., the mean of the eigenvalues of the unused eigenvectors, and regularize our matrix as 



CW = Qw 4- rl. In practice, this would require an explicit eigenvalue decomposition for 
each local covariance matrix, which is very computationally expensive. We can obtain a 
similar regularization parameter by noting that ^ Aj = tr(CW). If our D ouf -dimensional 
projection contains the majority of the variance (which, under the assumptions of the 
algorithm, it should), we see that r will be very small compared to the trace of C®. For 
this paper, the value r = (10 _3 )tr(C^) was used. 



To simplify this, we define a sparse weight matrix W with columns populated by the weight 
vectors w^, such that 



C. Determining The Optimal Projection 



The projection error is given by equation HJ 




(CI) 




(C2) 



The projection error can then be rewritten in a more compact form: 



2 



£ 2 (Y) = Y - YW 



(C3) 



It can be shown that this is equivalent to the quantity 




(C4) 
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where we have defined 

M = (I — W)(I — W) T . (C5) 

It is clear that equation IC4I is trivially minimized by Y = 0. In order to find solutions 
of interest, we will impose the constraint 

(YY T ) = I. (C6) 

That is, the rows of Y are orthonormal. We can then use Lagrangian multipliers Aj to 
constrain equation IC4[ 

£ 2 (Y) = tr(YMY T + A(I - YY T )) (C7) 

where A is the diagonal matrix of Lagrangian multipliers, with A& = Aj. Minimizing this 
with respect to Y gives the condition 

MY T - Y T A = 0, (C8) 

Denoting the rows of Y with the vectors y®, this can be written in a more familiar form: 

My w - A iy (i) = 0. (C9) 

We see that Aj and yW are simply the eigenvalues and eigenvectors of the symmetric- real 
matrix M. Equation IC8I can then be rewritten YMY T = A and our reconstruction error 
0C4p is simply 

£ 2 (Y)=tr(A) = ^A f . (CIO) 

i 

Evidently, the desired projection is contained in the eigenvectors of M corresponding 
to the smallest eigenvalues. Because of the constraint in equation [B2\ it is clear that 
y(°) oc [1, 1, 1, ...1] T is a solution with eigenvalue Ao = 0. This solution amounts to a 
translation in space, and can be neglected. 
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Thus, we see that the .Do^-dimensional projection Y of a data matrix X is given 
simply by the (D out + l)-dimensional null-space of matrix M = (I — W)(I — W) r , with the 
lowest eigenvector neglected. 

Fortunately, a full matrix diagonalization is not necessary to find only a few extreme 
eigenvectors. Iterative methods such as Arnoldi decomposition or Lanczos factorization can 
be used to determine this null-space rather efficiently What is more, though the matrix 
I — W is a sparse [N x N] matrix, with only K + 1 nonzero entries per column. Thus, 
the LLE algorithm has large potential for optimization in both computational and storage 
requirements. 

D. LLE Algorithm 

Here is a description of the basic LLE algorithm, taking into account the above 
discussion. Given N points in D in dimensions, to be projected to D out dimensions, using 
the K nearest neighbors of each point: 

1. for i = 1 -> N 

(a) find the K nearest neighbors of point % 

(b) construct the local covariance matrix, (eqn. IB4j) 

(c) regularize the local covariance matrix by adding a fraction of its trace to the 
diagonal: = C^' + S ■ tr(C^)I. Choosing 5 = 10~ 3 works well in practice 
(See appendix IBl) 

(d) determine by solving equation IB10I 

(e) use w' 1 ' to populate column i of the weight matrix W, with Wji = if point j is 
not a neighbor of point i. 
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2. Construct the matrix M = (I - W)(I - W) T 

3. Determine the D out + 1 eigenvectors of matrix M corresponding to the smallest 
eigenvalues. The LLE projection is given by these, neglecting the lowest (which is 
merely a translation). 



E. Description of DimReduce code 

DimReduce is a C++ code which performs LLE and some of its variants on large 
datasets. To handle the computationally intensive pieces of the algorithm, it includes a 
basic interface to the optimized FORTRAN routines in BLAS, LAPACK, and ARPACK. 

The DimReduce interface works with data saved in the FITS format, a commonly 
used binary file format used for astronomical images and data. Four variants of LLE are 
available: 



LLE: This is the basic LLE algorithm. The user must provide input data, the number of 
nearest neighbors, and either the output dimension or a desired variance to preserve 
within each neighborhood. 

HLLE: This is a variant of LLE which uses a Hessian estimator within each neighborhood 
which better prese r ves a ngles in each neighborhood. The algorithm is based on 



Donoho fc Grimed (120031 ). Inputs are similar to LLE, with the added restriction that 



K must be larger than Di n , the input dimension. 



RLL E1: T his is a robust variant of LLE, based on an algorithm outlined in 



Chang fc Yeung 



(120061 ) . To detect outliers, an iteratively reweighted PCA is performed on each 
neighborhood. A reliability score is assigned to each point based on the number of 
neighborhoods in which it appears, as well as its iteratively determined contribution 
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within each neighborhood. The user provides inputs similar to those of LLE, as well 
as an reliability cutoff: a number between zero and one that represents the fraction of 
points believed to be outliers. 



Chang fc Yeungj (120061 ) . where 



RLLE2: This is another robust variant of LLE outlined in 

outliers are handled on a neighborhood by neighborhood basis. Reliability scores 
are computed as described above, and within each neighborhood the nearest K + R 
nearest neighbors are found, and then the R neighbors with the lowest reliability 
scores are discarded. The user provides inputs similar to those of LLE, as well as the 
number R of excess neighbors to find within each neighborhood. 

LLEsig: This is a utility routine which can be used in the data selection procedure 

described in section 15 .31 Given an input matrix, a number of nearest neighbors, and 
an output dimension, this will compute the local linear reconstruction error for each 
point. The result is the local reconstruction error for each point. 



In addition, a python utility is provided to plot and examine the results, using the open 
source packages scipy and matplotlib. All of the source code is available on the web£j, 
along with test data and example scripts. 

There are a few possible optimizations to DimReduce which have not yet been 
implemented. First of all, the sparse weight matrix is presently stored in dense format. 
Storage requirements and execution time could be greatly reduced by taking advantage 
of its sparsity. Secondly, the i^-nearest neighbor search is presently performed using a 
non-optimized brute-force algorithm. Use of a ball-tree, cover-tree, or similar algorithm 
would greatly speed up the performance, especially when projecting new test data onto 
a training surface. Thirdly, the entire package could be fairly easily restructured to take 



3 http: / / ssg.astro.washington.edu/software 
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advantage of parallel computing capabilities. Parallelized versions of BLAS, LAPACK, 
and ARPACK are all publicly available, and the computation of nearest neighbors and 
construction of the weight matrix lends itself easily to parallelization. Implementation of 
these improvements, along with the use of the data sampling strategies outlined in section 
15.31 give LLE the possibility of being used for automated classification of objects in a large 
variety of contexts. 
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